The S&P 500 is an American stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE, NASDAQ, or the Cboe BZX Exchange. It is one of the most commonly followed equity indices, and one of the best representations of the U.S. stock market. Many people consider it to be a bellwether and leading indicator for the economy in addition to the default vehicle for passive investors who want exposure to the U.S. economy via index funds. Since 1957, the S&P 500 has performed remarkably, outpacing other major asset classes such as bonds or commodities.
In this project, we would like to examine the possible models that help predict the performance of S&P 500 index. This would allow investors (portfolio managers as well as individual investors) to predict the general economic condition and investment climate given the current market conditions.
We will include both quantitative and qualitative inputs in the model. Quantitative inputs include Fama-French three-factor variables (SML, HML, risk premium), macroeconomic indicators (e.g. US GDP, unemployment rate, inflation index), technical analysis indicators (e.g. relative strength indicator, Z-score, accumulation) and indices for other financial assets or sectors (bond prices, index for emerging markets). Apart from these, market sentiment would also be used as a qualitative input, which can be obtained through tweets.
The first model we have developed is to nowcast the current US GDP. The Real GDP is universally recognized as the best economic indicator in measuring the state of the economy, and thus it has a huge impact on the S&P 500. However, GDP figures for the current quarter are only release 3 months after the end of the quarter, limiting the decision making of investors. Therefore, it would be of great value to nowcast the current GDP and use it in our subsequent regression analysis.
The second model is a text analytics model, where we conduct sentiment analysis on tweets from 300 influential Finance Twitter accounts. With the sentiment values, we can use the market sentiment information to predict S&P 500 performance.
Then, we use regression to determine the best model that gives the highest accuracy in predicting S&P 500.
Lastly, the team will simulate the possible economic condition in 2019 and make use of our final model to determine the possibility of a market crash or boom in 2019.
Motivation: The Real GDP is largely recognized as the best economic indicator in measuring the state of the economy. Consquently, GDP figures can have a large impact on the stock market. However, GDP figures for the current quarter are only release 3 months after the end of the quarter, limiting the decision making of investors. But we know that there are other economic indicators that are released monthly and the information provided by these indicators may be useful in informing us of the current state of the economy.
The first step involves the collecting and processing of data for the various economic indicators.
library(timeDate)
library(DataCombine)
#read in multiple csv files
temp = list.files("./Nowcast Data", pattern="*.csv")
myfiles = lapply(temp, function(x) read.csv(paste("./Nowcast Data/", x, sep=""),stringsAsFactors = FALSE))
names(myfiles) <- gsub(".csv","",
list.files("./Nowcast Data",full.names = FALSE),
fixed = TRUE)
#function to convert string data into numeric values and align the dates of the various economic indicators
readAndClean <- function(originalData) {
histData <- originalData
histData[ ,c('Forecast', 'Time')] <- list(NULL)
histData$Actual <- as.numeric(gsub(pattern = '%|M|K',replacement = '',x = histData$Actual))
histData$Month <- match(substr(x = histData$Release.Date,start = 1,stop=3),month.abb)-1
histData$Year <- ifelse(histData$Month==0,as.numeric(substr(x = histData$Release.Date,start=9,stop=13))-1,as.numeric(substr(x = histData$Release.Date,start=9,stop=13)))
histData$Month <- ifelse(histData$Month==0,histData$Month+12,histData$Month)
histData$Month <- ifelse(histData$Month<10,paste0("0",histData$Month),histData$Month)
histData$Date <- timeLastDayInMonth(as.Date(as.character(paste0("01-",histData$Month,"-",histData$Year)),"%d-%m-%Y"))
histData <- histData[,-c(1,3,4)]
histData <- histData[c(2,1)]
VariableName <- deparse(substitute(originalData))
variableName <- sub('.*', '', VariableName)
colnames(histData) <- c(paste0(variableName,"_date"),variableName)
return (histData)
}
df <- lapply(myfiles,readAndClean)
df <- do.call("cbind",df)
#To be consistent in units
df$CapUtilize. <- df$CapUtilize./100
df$CoreCPI. <- df$CoreCPI./100
df$CorePPI. <- df$CorePPI./100
df$HourlyEarnings. <- df$HourlyEarnings./100
df$RetailSales. <- df$RetailSales./100
df$Unemployment. <- df$Unemployment./100
#Create a new data frame to hold economic variables that are already recorded in terms of MoM change
newdf <- df[,c("CapUtilize._date","CoreCPI.","CorePPI.","HourlyEarnings.","RetailSales.")]
names(newdf)[names(newdf) == 'CapUtilize._date'] <- 'CoverDate'
#Refers to economic indicators that have to be transformed to MoM change
colToChange <- c("CapUtilize.","Unemployment.","BizOutlook.","ConsumerSentiment.","HousingStarts.","ISMPMI.","NFP.")
for (i in 1:7) {
changedDf <- PercChange(df,Var=colToChange[i],type="percent",slideBy=1)
newdf <- cbind(newdf,changedDf[,ncol(changedDf)])
names(newdf)[ncol(newdf)] = colToChange[i]
newdf[,ncol(newdf)] <- newdf[,ncol(newdf)]/100
}
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading CapUtilize. by 1 time units.
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading Unemployment. by 1 time units.
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading BizOutlook. by 1 time units.
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading ConsumerSentiment. by 1 time units.
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading HousingStarts. by 1 time units.
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading ISMPMI. by 1 time units.
## Warning: PercChange is depricated. Please use change.
##
## Remember to put data in time order before running.
##
## Leading NFP. by 1 time units.
The next step involves aggregating monthly data into quarterly frequency in order to perform a regression against the quarterly GDP later on. We used the bridge equation detailed in a BOJ paper (https://www.boj.or.jp/en/research/wps_rev/wps_2018/data/wp18e18.pdf) to aggregate the monthly data. The aggregating method is just a linear combination of the lagged values for that particular indicator.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
newdf1 <- newdf
for (i in 1:nrow(newdf1)) {
if ((month(newdf1$CoverDate[i]) == 3 | month(newdf1$CoverDate[i]) == 6 | month(newdf1$CoverDate[i]) == 9 | month(newdf1$CoverDate[i]) == 12) & year(newdf1$CoverDate[i])>2013) {
for (j in 2:12) {
newdf1[i,j] <- 1/3*(newdf1[i,j] + 2*newdf1[i+1,j] + 3*newdf1[i+2,j] + 2*newdf1[i+3,j] + newdf1[i+4,j])
}
}
}
After the aggregation, we can remove the monthly data which are not at the quarter end. We consider the data from 2014 - 2018 for our regression. Intuitively, the various economic indicators should have a moderate to strong correlation, considering that they all serve to measure the underlying economic activity. We plot the correlation chart to confirm our intuition.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following objects are masked from 'package:timeDate':
##
## kurtosis, skewness
## The following object is masked from 'package:graphics':
##
## legend
predictors <- newdf1 %>%
select(-1) %>%
slice(seq(4,61,3))
chart.Correlation(predictors, histogram=TRUE)
chart.Boxplot(predictors,ylim=c(-0.5,1.5))
Although the correlation did not show strong correlation across all indicators, there were still several pairs of moderate to strong correlation. As such, we will extract the principal components to avoid the problem of multi-collinearity when we perform our regression later on. Additionally, the boxplot revealed the outsized variances of some economic indicators. Hence, we will use a standardized principal component analysis.
PCAValues <- prcomp(x = predictors, scale. = TRUE, center = TRUE)
PCAData <- unclass(PCAValues)
eigenVectors <- as.data.frame(PCAData[["rotation"]])
biplot(PCAValues,scale=0)
#The biplot reveals that retail sales, housing starts, production managers' index (PMI), business outlook and hourly earnings strongly influence PC1. It suggests that PC1 is about business conditions, since the first 4 predictors represent the sales volume of businesses and a higher hourly earnings growth would increase the operating cost for businesses.
#The second principal component is probably related to labour market conditions, where higher factory capacity utilization means lower unemployment.
summary(PCAValues)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8095 1.5214 1.2967 1.00608 0.9119 0.77545 0.72349
## Proportion of Variance 0.2977 0.2104 0.1529 0.09202 0.0756 0.05467 0.04759
## Cumulative Proportion 0.2977 0.5081 0.6609 0.75296 0.8286 0.88322 0.93081
## PC8 PC9 PC10 PC11
## Standard deviation 0.57147 0.46119 0.35596 0.30838
## Proportion of Variance 0.02969 0.01934 0.01152 0.00865
## Cumulative Proportion 0.96050 0.97984 0.99135 1.00000
screeplot(PCAValues)
The scree plot and a more lenient use of the kaiser rule suggest that the first 5 principal components explain most of the variances. Therefore, we will perform a linear regression of the Real GDP on the first 5 principal components.
#Transform our initial economic indicators into z-scores as we used standardized PCA
zpredictors <- as.data.frame(scale(x = predictors,center = TRUE,scale = TRUE))
#Matrix multiplication with eigenvectors to obtain a time series of the principal components
TimeSeries <- as.matrix(zpredictors) %*% as.matrix(eigenVectors)
predictorsTimeSeries <- as.data.frame(TimeSeries)
#Read in and store the dependent variable, i.e. GDP
GDPData <- read.csv("GDP.csv",header = TRUE,stringsAsFactors = FALSE)
GDPData[ ,c('Forecast', 'Time')] <- list(NULL)
GDPData$Actual <- as.numeric(gsub(pattern = '%',replacement = '',x = GDPData$Actual))/100
GDPData$Release.Date <- NULL
GDPData <- scale(GDPData,center = TRUE,scale = TRUE)
fullTimeSeries <- cbind(GDPData,predictorsTimeSeries)
model <- lm(Actual~PC1 + PC2 + PC3 + PC4 + PC5,fullTimeSeries)
summary(model)
##
## Call:
## lm(formula = Actual ~ PC1 + PC2 + PC3 + PC4 + PC5, data = fullTimeSeries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34472 -0.22318 0.08653 0.65270 1.07565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.351e-17 2.174e-01 0.000 1.0000
## PC1 -2.631e-01 1.232e-01 -2.135 0.0509 .
## PC2 5.477e-02 1.466e-01 0.374 0.7143
## PC3 -7.227e-02 1.720e-01 -0.420 0.6807
## PC4 -1.201e-01 2.217e-01 -0.542 0.5965
## PC5 -2.370e-01 2.445e-01 -0.969 0.3488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9721 on 14 degrees of freedom
## Multiple R-squared: 0.3037, Adjusted R-squared: 0.05507
## F-statistic: 1.221 on 5 and 14 DF, p-value: 0.3498
#unfortunately, the model with the first 5 principal components as predictors shows results that are unacceptable with an Adjusted R-squared of 0.05507 and p-value of 0.3498.
#we next try a stepwise regression on the principal components to see if we can achieve a model with better accuracy.
stepWisePCA <- step(model)
## Start: AIC=3.73
## Actual ~ PC1 + PC2 + PC3 + PC4 + PC5
##
## Df Sum of Sq RSS AIC
## - PC2 1 0.1319 13.361 1.9320
## - PC3 1 0.1669 13.396 1.9842
## - PC4 1 0.2774 13.506 2.1485
## - PC5 1 0.8879 14.117 3.0327
## <none> 13.229 3.7335
## - PC1 1 4.3070 17.536 7.3704
##
## Step: AIC=1.93
## Actual ~ PC1 + PC3 + PC4 + PC5
##
## Df Sum of Sq RSS AIC
## - PC3 1 0.1669 13.528 0.1802
## - PC4 1 0.2774 13.638 0.3429
## - PC5 1 0.8879 14.249 1.2187
## <none> 13.361 1.9320
## - PC1 1 4.3070 17.668 5.5203
##
## Step: AIC=0.18
## Actual ~ PC1 + PC4 + PC5
##
## Df Sum of Sq RSS AIC
## - PC4 1 0.2774 13.805 -1.4139
## - PC5 1 0.8879 14.416 -0.5484
## <none> 13.528 0.1802
## - PC1 1 4.3070 17.835 3.7084
##
## Step: AIC=-1.41
## Actual ~ PC1 + PC5
##
## Df Sum of Sq RSS AIC
## - PC5 1 0.8879 14.693 -2.1673
## <none> 13.805 -1.4139
## - PC1 1 4.3070 18.112 2.0170
##
## Step: AIC=-2.17
## Actual ~ PC1
##
## Df Sum of Sq RSS AIC
## <none> 14.693 -2.16725
## - PC1 1 4.307 19.000 0.97413
summary(stepWisePCA)
##
## Call:
## lm(formula = Actual ~ PC1, data = fullTimeSeries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.39735 -0.36408 -0.01338 0.56518 1.20980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.693e-17 2.020e-01 0.000 1.0000
## PC1 -2.631e-01 1.145e-01 -2.297 0.0338 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9035 on 18 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.1837
## F-statistic: 5.276 on 1 and 18 DF, p-value: 0.03383
#This time, we achieve an adjusted R-squared of 0.4784 and significant p-value of 0.01884 on principal components 1,5,6,8,9,11.
Unfortunately, we are not confident in explaining the later few principal components that were chosen by the stepwise regression. As such, we decided to drop the nowcasting exercise and proceed with other models instead.
In the following section, we employ text analytics techniques to investigate market sentiment, and attempt to crystalize any relationship between the sentiment and immediate changes in the S&P 500 index value.
Tweets from 300 twitter accounts that publish contents relevant to the financial market are gathered in preparation for the analysis. The list of accounts is gathered after online research, and it includes major news outlets such as Reuters, financial institutions such as the Federal Reserve, as well as many financial experts, amongst others. The full list is stored in “TwitterAccount.csv”.
To start off with, 3,000 tweets from each of the accounts are obtained, and eventually stored in a file named “TwitterFullData.csv” for further processing.
library(twitteR)
library(ROAuth)
#Setting up the Twitter developer account
consumer_key <- "wD6Ji9xpUGpDKtmKOLAlKJdAk"
consumer_secret <- "HUCTSsLXDAZBSpLLDcG3NUneynm1VfOIHNjy04XSlEatS5dLGs"
token <- "1115068400819036160-ihFxhYoxOv3eZkh2CurhTmLhWH0XuH"
token_Secret <- "OVyYavVITFMFBCXvqDYZq03XC2nDU3oK3rxoiWlPOS8lc"
setup_twitter_oauth(consumer_key, consumer_secret, token, token_Secret)
#Extracting raw tweets from identified accounts
KeyAcct <- read.csv("TwitterAccount.csv")
Tweets <- userTimeline(KeyAcct[1,1], n = 3000, includeRts = FALSE)
Tweet_Full <- twListToDF(Tweets)
for(i in 2:nrow(KeyAcct)){
if(getUser(KeyAcct[i,1])$protected==F)
KeyAcct_Tweet <- userTimeline(KeyAcct[i,1], n = 3000, includeRts = FALSE)
Tweet_Full <- rbind(Tweet_Full, twListToDF(KeyAcct_Tweet))
}
write.csv(Tweet_Full, file="TwitterFullData.csv")
Next, we clean the crawled tweets by firstly extracting only those posted between Jan 1st 2014 and Dec 31st 2018, which is our intended study period. Retained tweets are further filtered through comparing their content with a list of 238 financial terms. The full list of terms is stored in “EffectiveWords.csv”, and it contains terms such as “yield”, “risk”, and “financ”. Tweets that do not contain any of the 238 terms are to be discarded, since it is likely that they do not carry information that is useful for our investigation on the market sentiment. After the filtering is done, the final set of tweets is stored into “tweet_useful.csv”.
Tweet_Full >- read.csv("TwitterFullData.csv")
Tweet_Full <- Tweet_Full[,c("text","created")]
#Filter tweets within time period 01/01/2014 - 31/12/2018
Tweet_Full$created <- substr(Tweet_Full$created, 1, 10)
Tweet_Full$created <- as.Date(Tweet_Full$created, format = "%Y-%m-%d")
Tweet_Full <- subset(Tweet_Full, Tweet_Full$created >= as.Date("2014-01-01"))
Tweet_Full <- subset(Tweet_Full, Tweet_Full$created <= as.Date("2018-12-31"))
#Read in the effective words library to filter out irrelevant tweets
effective_words <- read.csv("EffectiveWords.csv", stringsAsFactors=FALSE, header = T)
#Create an empty data frame to store the useful tweets and their creation time
Tweet_useful = data.frame(matrix(nrow = 1, ncol=2))
colnames(full_tweet) <- c("text", "created")
#Go through tweet by tweet to identify the useful tweet that contain the key words in the library
for(i in 1:nrow(Tweet_Full)){
for(j in 1:nrow(effective_words)){
if((length(grep(lib[j], Tweet_Full[i,1])) > 0)){
Tweet_useful <- rbind(Tweet_useful, Tweet_Full[i,])
break
}
}
}
write.csv(Tweet_useful, file="tweet_useful.csv")
Some tweet snippets are shown below:
Tweet_useful <- read.csv("tweet_useful.csv")
head(Tweet_useful$text, 10)
## [1] @joetrader6 @CNBC @WSJ @business @bpolitics @hblodget There are no such restrictions on elected officials
## [2] @jessefelder Jesse, how does this compare to U.S. and E.U.?
## [3] @d0ubleorn0thing Could be completed, lead to brief FOMO, then roll over
## [4] @imarvindmahesh I do not have a trade on. It is an appraisal of the chart. An opinion is not a position.
## [5] Price action in #Cable could qualify as an "end-around," implying an advance back to 1.3800 $gbpusd https://t.co/shbkOdO7hk
## [6] Many chartists view $BTC as a H&S bottom. I am NOT among them. Higher probability is that bear market is not over. https://t.co/64BmxmNWyy
## [7] When back testing a trading system make sure to use out-of-sample data in addition to in-sample data to confirm a t… https://t.co/snOb9JULYw
## [8] @HoustonMarck @SJosephBurns @traderstewie @TraderMentality @option_snipper @allstarcharts @AOTtrades Trading does n… https://t.co/2MpZVVprvX
## [9] @Grizzlyshort Lower risk during losing streaks, ramp back up when wins occur. I use a grid system to determine sizing/risk per trade
## [10] @Secret_Profits A trading approach can be out of synch with markets -- this is normal the purpose of aggressive ris… https://t.co/dygLv7fqRL
## 60053 Levels: _________\n| Frankly, what |\n| this newsletter |\n| has to say is a |\n| very small part |\n| of the pressure… https://t.co/9PklNy7jDI ...
For an initial quick analysis of the text, we produce a word cloud to display the most frequently appeared words in our collected tweets. With the set of useful tweets, we first sort the tweets in chronological order based on their creation time, and then perform the necessary text processing.
Tweet_useful_sorted <- Tweet_useful[order(as.Date(Tweet_useful$created, format="%d/%m/%Y")),]
library(tm)
## Loading required package: NLP
Tweet_Text_Corpus <- Corpus(VectorSource(Tweet_useful_sorted$text))
#Text cleaning
RemoveURL <- function(x) gsub("(http://t.co)(.*)", "", x)
RemoveURLs <- function(x) gsub("(https://t.co)(.*)", "", x)
RemoveName <- function(x) gsub("@\\w+", "", x)
RemoveAmp <- function(x) gsub("amp;", "", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveURL))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveURL)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveURLs))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveURLs)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveName))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveName)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(RemoveAmp))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(RemoveAmp)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, stripWhitespace):
## transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus,
## content_transformer(tolower)): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, removeWords,
## stopwords("english")): transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, removeNumbers):
## transformation drops documents
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, removePunctuation):
## transformation drops documents
#Text stemming
library(SnowballC)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, wordStem, language = "english")
## Warning in tm_map.SimpleCorpus(Tweet_Text_Corpus, wordStem, language =
## "english"): transformation drops documents
#Identify the most frequently appeared words
Tweet_tdm <- TermDocumentMatrix(Tweet_Text_Corpus)
Tweet_RemoveSparse_tdm <- removeSparseTerms(Tweet_tdm, sparse = 0.999)
freq <- sort(rowSums(as.matrix(Tweet_RemoveSparse_tdm)), decreasing=TRUE)
head(freq, 50)
## … market just like will year can trading now
## 5077 3179 2684 2508 2491 2397 2182 2140 2131
## one time new good today think last day week
## 2105 2084 1988 1939 1740 1722 1707 1602 1572
## since get see years stock short high low spx
## 1533 1496 1488 1481 1479 1406 1392 1351 1338
## trade people stocks ’s great long still know much
## 1251 1247 1237 1203 1201 1163 1140 1131 1099
## back next icymi big markets right also calls price
## 1090 1031 988 973 962 961 957 951 950
## first vix make call many
## 943 936 934 929 919
Looking at the above most frequently appeared words gives us an idea of words that may require fixing or replacement.
#Fix or remove words of undesirable form
ReplaceWord <- function(x) gsub("vol", "volume", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("…", "", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("’s", "", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("t…", "", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
ReplaceWord <- function(x) gsub("“", "", x)
Tweet_Text_Corpus <- tm_map(Tweet_Text_Corpus, content_transformer(ReplaceWord))
more.stop.words <- c("just", "time", "week", "weeks", "like", "day", "days", "years", "stocks", "will", "one", "can", "now", "look", "month", "months", "say", "way", "icymi", "next", "things", "thing", "markets", "also", "vix", "make", "many", "come", "two", "may", "yes", "name", "really", "actually", "yet", "even", "another", "always", "seem", "every", "got", "looks", "ago", "getting", "probably", "part", "might", "pretty", "give", "idea", "twitter", "tweet", "live", "said", "guy", "something", "likely", "seems", "says", "said", "don’t", "ever", "maybe", "though", "almost", "times", "given", "morning", "seen", "’re", "someone", "everyone", "nothing", "yeah", "anyone", "todays", "per", "already", "saying", "aapl", "thank", "trying")
Tweet_Review_Corpus <- tm_map(Tweet_Text_Corpus, removeWords, more.stop.words)
#Display the word cloud to view the most frequently appeared words from the tweets
library(wordcloud)
## Loading required package: RColorBrewer
##
## Attaching package: 'wordcloud'
## The following object is masked from 'package:PerformanceAnalytics':
##
## textplot
wordcloud(Tweet_Review_Corpus, min.freq = 450, max.words = Inf,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))
The word cloud obtained testifies the content validity of the tweets that we have gathered, as the frequently appeared words are in general all finance-specific (such as fund, spx, and stock) or finance-related (such as trading, growth, and price). We are thus assured that any further analysis on these tweets text will produce relevant and useful results.
After processing the tweets, we are going to perform sentiment analysis on the relevant tweets. To conduct the analysis, we will be using the Syuzhet package, which comes with four sentiment dictionaries (“syuzhet”, “bing”, “afinn”, and “nrc”).
By using the get_sentiment function, we can get the sentiment score for each tweet. One thing to take note is that different dictionaries use different scales.
library(syuzhet)
Tweet_Review_Data <- as.vector(t(data.frame(text = sapply(Tweet_Review_Corpus, as.character), stringsAsFactors = FALSE)))
#Get sentiment values using four different dictionaries
syuzhet_sentiment <- get_sentiment(Tweet_Review_Data, method = "syuzhet", language = "english")
bing_sentiment <- get_sentiment(Tweet_Review_Data, method = "bing", language = "english")
afinn_sentiment <- get_sentiment(Tweet_Review_Data, method = "afinn", language = "english")
nrc_sentiment <- get_sentiment(Tweet_Review_Data, method = "nrc", language = "english")
We combine scores from different dictionaries into one data frame sentiment_values. Then, we compare the pair-wise correlation coefficient among different dictionary methods. From the result, we can see that outcome using syuzhet dictionary has the highest correlation coefficient with the outcomes from other dictionaries; while the correlation coefficients between nrc and bing as well as nrc and afinn are relatively low.
#Combine different scores into a data frame, and compare pairwise correlation coefficeient among different dictionary methods
sentiment_values <- data.frame(syuzhet=syuzhet_sentiment, bing=bing_sentiment, afinn=afinn_sentiment, nrc=nrc_sentiment)
cor(sentiment_values)
## syuzhet bing afinn nrc
## syuzhet 1.0000000 0.7216827 0.6903409 0.6792909
## bing 0.7216827 1.0000000 0.6625483 0.4687012
## afinn 0.6903409 0.6625483 1.0000000 0.4501116
## nrc 0.6792909 0.4687012 0.4501116 1.0000000
We also add in the date when the tweet was posted. Moreover, week number and month number for each tweet are added in. As the time horizon of our analysis is from 2014 to 2018, we will set the week number as 1 for all those tweets that were posted in the first week of 2014. Likewise, we will set the month number as 1 for tweets that were posted in January 2014.
We calculate the monthly mean sentiment values under different dictionaries and plot the monthly sentiment values over time in the graphs below. The first plot aggregates the sentiment trends from four dictionaries to have a better comparison among dictionaries.
From the plots, we can see that it generally reflects the market sentiment. For example, between Nov 2015 to Feb 2016 (Month 23 to 26), the sentiment values are in the low end. This can be possibly explained by the Chinese stock market turbulence, which affects the overall market confidence. Sentiment values in 2018 are in the relatively high end, because during this period, S&P 500 index kept making record highs.
#Add in date as well as week number and month number (starting reference point 29/12/2013)
sentiment_values$date <- as.Date(Tweet_useful_sorted$created,"%d/%m/%Y")
sentiment_values$week <- as.numeric(floor(difftime(sentiment_values$date, as.Date("29/12/2013","%d/%m/%Y"), units = "weeks")))+1
sentiment_values$week <- as.factor(sentiment_values$week)
sentiment_values$month <- (as.numeric(substr(sentiment_values$date,1,4))-2014)*12+as.numeric(substr(sentiment_values$date,6,7))
sentiment_values$month <- as.factor(sentiment_values$month)
#Time series plot: Monthly Sentiment Values over Time
sentiment_time_series_byMonth <-
data.frame(month=1:60, syuzhet=tapply(sentiment_values$syuzhet, sentiment_values$month, mean),
bing=tapply(sentiment_values$bing, sentiment_values$month, mean),
afinn=tapply(sentiment_values$afinn, sentiment_values$month, mean),
nrc=tapply(sentiment_values$nrc, sentiment_values$month, mean))
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
ggplot(data = sentiment_time_series_byMonth) + geom_line(aes(x = month, y = syuzhet, color = "red")) + geom_line(aes(x = month, y = bing), color = "blue") + geom_line(aes(x = month, y = afinn), color = "green") + geom_line(aes(x = month, y = nrc), color = "black") + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time") + theme(legend.position = "none")
ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = syuzhet)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (syuzhet method)")
ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = bing)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (bing method)")
ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = afinn)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (afinn method)")
ggplot(data = sentiment_time_series_byMonth, aes(x = month, y = nrc)) + geom_line(size = 1) + labs(x = "Month", y = "Sentiment Values", title = "Monthly Sentiment Values over Time (nrc method)")
We would like to use the weekly sentiment value to predict whether S&P 500 has moved up or down during this week. If the sentiment value is above a certain threshold, we will classify the average sentiment as positive and we predict the S&P 500 has moved up; vice versa, if the sentiment value is below the threshold, we will classify the average sentiment as negative and we predict the S&P 500 has moved down.
A function findThreshold is developed to find the optimal classification threshold for each dictionary method, where the prediction accuracy produced by confusion matrix is the highest.
weekly_return <- read.csv("SPXWeeklyReturn.csv")
sentiment_time_series_byWeek <-
data.frame(week=1:262, updown = ifelse(weekly_return$Return>0,"Up","Down"),
return=weekly_return$Return,
syuzhet=tapply(sentiment_values$syuzhet, sentiment_values$week, mean),
bing=tapply(sentiment_values$bing, sentiment_values$week, mean),
afinn=tapply(sentiment_values$afinn, sentiment_values$week, mean),
nrc=tapply(sentiment_values$nrc, sentiment_values$week, mean))
#function to find the best threshold with highest accuracy
findThreshold <- function(method){
best_threshold <- 0
best_accuracy <- 0
if (method=="syuzhet"){
data <- syuzhet_sentiment
} else if (method=="bing"){
data <- bing_sentiment
} else if (method=="afinn"){
data <- afinn_sentiment
} else if (method=="nrc"){
data <- nrc_sentiment
} else {
return()
}
for (i in seq(min(data),max(data),0.01)){
table <- table(sentiment_time_series_byWeek$updown == "Up", sentiment_time_series_byWeek[,method]>i)
if (dim(table)[2]==1)
next
accuracy <- (table[1,1]+table[2,2])/sum(table)
if (accuracy>best_accuracy){
best_threshold <- i
best_accuracy <- accuracy
}
}
return(c(threshold=best_threshold,accuracy=best_accuracy))
}
syuzhet_threshold <- findThreshold("syuzhet")
bing_threshold <- findThreshold("bing")
afinn_threshold <- findThreshold("afinn")
nrc_threshold <- findThreshold("nrc")
syuzhet_threshold
## threshold accuracy
## -0.1700000 0.5801527
bing_threshold
## threshold accuracy
## -0.3500000 0.5954198
afinn_threshold
## threshold accuracy
## -0.6900000 0.5839695
nrc_threshold
## threshold accuracy
## -0.2700000 0.5839695
From the result, we can see that the prediction accuracy for four methods are roughly the same, around 58%, which is not very high. There are several explanations for this:
Nevertheless, despite the result, our text analytic approach does provide hedge funds or other asset management companies with a general methodology on how to leverage Twitter tweets sentiment information to aid investment decisions.
Regression is carried out using weekly spx prices as the dependent variable from Jan 2014 to Dec 2018. The explanatory variables contain the following: 1. Market & S&P Portfolio Indices: S&P 500 portfolio price to equity ratio (p/e), the volatility index (vix), put to call ratio (put/call) 2. Technical Analysis Indicators: Zscore, Relative Stregnth Index (rsi), volume weighted average price (vwap) 3. Fama-French 3 Factor Model Variales: Excess market return (excess.mkt.return), Small Minus Big (smb), High Minus Low (hml), risk-free rate (rf) 4. Macroeconomic Indicators: Gross domestic product of US (GDP), Inlfation rate, Unemployment rate
Data cleaning is first carried out to prepare for regression later.
data<-read.csv("RegressionData.csv")
colnames(data) <-c("date","spx","p/e","vix","put/call","Zscore","rsi","vwap","excess.mkt.return","smb","hml","rf","gdp","inflation","unemployment")
data<-na.omit(data)
#format first column to proper date format
data$date<-as.Date(data$date,origin="1899-12-30")
data$gdp<-as.numeric(data$gdp)
data$inflation<-as.numeric(data$inflation)
data$unemployment<-as.numeric(data$unemployment)
summary(data)#check data type
## date spx p/e vix
## Min. :2014-02-07 Min. :1797 Min. :15.83 Min. : 9.14
## 1st Qu.:2015-04-29 1st Qu.:2033 1st Qu.:18.10 1st Qu.:11.97
## Median :2016-07-18 Median :2127 Median :19.30 Median :13.53
## Mean :2016-07-18 Mean :2264 Mean :19.43 Mean :14.77
## 3rd Qu.:2017-10-07 3rd Qu.:2507 3rd Qu.:20.81 3rd Qu.:16.27
## Max. :2018-12-28 Max. :2930 Max. :23.39 Max. :30.11
## put/call Zscore rsi vwap
## Min. :1.036 Min. :-3.1940 Min. :28.15 Min. :1787
## 1st Qu.:1.608 1st Qu.: 0.1802 1st Qu.:42.51 1st Qu.:2002
## Median :1.900 Median : 1.2391 Median :51.57 Median :2098
## Mean :1.894 Mean : 0.9906 Mean :51.65 Mean :2229
## 3rd Qu.:2.134 3rd Qu.: 2.0707 3rd Qu.:59.24 3rd Qu.:2461
## Max. :3.417 Max. : 4.2366 Max. :82.45 Max. :2826
## excess.mkt.return smb hml rf
## Min. :-7.3300 Min. :-3.1700 Min. :-3.8400 Min. :0.00000
## 1st Qu.:-0.6950 1st Qu.:-0.7450 1st Qu.:-0.8200 1st Qu.:0.00000
## Median : 0.3150 Median : 0.0150 Median :-0.1550 Median :0.00500
## Mean : 0.1669 Mean :-0.0491 Mean :-0.0459 Mean :0.01196
## 3rd Qu.: 1.2825 3rd Qu.: 0.6225 3rd Qu.: 0.6025 3rd Qu.:0.02100
## Max. : 4.7800 Max. : 6.0700 Max. : 3.9100 Max. :0.04800
## gdp inflation unemployment
## Min. :-0.007330 Min. :-0.0002501 Min. :0.03700
## 1st Qu.: 0.003358 1st Qu.: 0.0002499 1st Qu.:0.04275
## Median : 0.005701 Median : 0.0002499 Median :0.04900
## Mean : 0.005550 Mean : 0.0003133 Mean :0.04939
## 3rd Qu.: 0.008211 3rd Qu.: 0.0004996 3rd Qu.:0.05500
## Max. : 0.012272 Max. : 0.0007492 Max. :0.06700
A simple plot is produced to have a rough idea of how the dependent variable, S&P 500 index value (denoted as spx) looks like.
library(ggplot2)
ggplot(data, aes(x=date, y=spx)) + geom_line()
Further data cleaning is carried out to remove variables that exhibit significant correlation with other variables. This is to eliminate inaccuracies due to the possibility of multicollinearity.
library(PerformanceAnalytics)
data2<-subset(data[,-1]) #remove the first column "date", leave only numeric values
#summary(data2)
chart.Correlation(data2,histogram=TRUE,pch=19) #study the correlation between column variables
#vwap and zscore show significant correlation with a few other variables; they are removed from the model
data2<-data2[,-c(5,7)]#remove columns with high correlation
data4 <- data2
A simple linear model is first developed. 80% of the data is used for the training set while the remaining 20% is classified as the test set.
train <- floor(0.8*nrow(data2))
data2.x <- model.matrix(spx~., data = data2)
data2.x <- data2.x[1:train,-1]
data2.y <- data2$spx[1:train]
The LOOCV (leave one out cross validation) method is applied to the dataset.
library(broom)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.0.1 ✔ purrr 0.3.0
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ xts::first() masks dplyr::first()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ dplyr::lag() masks stats::lag()
## ✖ xts::last() masks dplyr::last()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ lubridate::union() masks base::union()
library(HDtweedie)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded glmnet 2.0-16
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
library(surveillance)
## Loading required package: sp
## Loading required package: xtable
##
## Attaching package: 'xtable'
## The following object is masked from 'package:timeDate':
##
## align
## This is surveillance 1.17.0. For overview type 'help(surveillance)'.
##
## Attaching package: 'surveillance'
## The following object is masked from 'package:lubridate':
##
## year
Loocv.lasso.data2 <- cv.glmnet(x = data2.x, y = data2.y, alpha = 1, nfolds = nrow(data2.x))
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
plot(Loocv.lasso.data2)
This graph indicates the MSE for different lamda values. For parsimony purposes, we will use the lamda value with the CV error within 1 standard deviation from the best model.
glance(Loocv.lasso.data2)
## # A tibble: 1 x 2
## lambda.min lambda.1se
## <dbl> <dbl>
## 1 0.253 1.96
Running the model. Generate projected spx values and generate the confusion matrix to find out the accuracy rate of the linear model.
lambda1se.data2 <- Loocv.lasso.data2$lambda.1se
lasso.lambda1se <- glmnet(x = data2.x, y = data2.y, alpha = 1, lambda = lambda1se.data2)
xset.lambda1se <- tidy(lasso.lambda1se)$term
xset.lambda1se <- xset.lambda1se[-1]
f1 <- paste(xset.lambda1se[1])
for(i in 2:length(xset.lambda1se))
{
f1 <- paste(f1, " + ", xset.lambda1se[i])
}
f1 <- paste("spx", " ~ ", f1)
f1 <- as.formula(f1)
model1.1se <- glm(f1, data = data2)
pchisq(glance(model1.1se)$deviance, glance(model1.1se)$df.residual,lower.tail = F)
## [1] 0
Running the chi-squared test generates a p-value of approximately equal to 0, rejecting the null hypothesis that the model has no explanatory power. This proves that the model is able to explain the movement in spx.
The model summary for the linear model is given below:
summary(model1.1se)
##
## Call:
## glm(formula = f1, data = data2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -140.151 -34.612 -6.706 36.865 130.998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1202.599 162.039 7.422 1.90e-12 ***
## `p/e` 52.643 4.728 11.135 < 2e-16 ***
## vix -6.666 1.423 -4.684 4.67e-06 ***
## `put/call` 40.087 10.647 3.765 0.000209 ***
## excess.mkt.return 4.338 2.591 1.674 0.095375 .
## smb -3.599 3.165 -1.137 0.256635
## hml -5.543 3.019 -1.836 0.067605 .
## rf 15539.121 529.724 29.334 < 2e-16 ***
## gdp 5302.923 1062.736 4.990 1.15e-06 ***
## inflation 59867.855 19601.733 3.054 0.002506 **
## unemployment -3520.919 1221.542 -2.882 0.004298 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3052.544)
##
## Null deviance: 24459856 on 255 degrees of freedom
## Residual deviance: 747873 on 245 degrees of freedom
## AIC: 2793.3
##
## Number of Fisher Scoring iterations: 2
Plotting the predicted spx values with the residual (predicted - actual spx values).
Model1withRes <- augment(model1.1se)
pred.data2 <- predict(model1.1se, data2[(train+1):nrow(data2),])
res.data2 <- data2$spx[(train+1):nrow(data2)]-pred.data2
plot(pred.data2, res.data2, type="p")
abline(h=0, col="red")
The plot shows that the residual appears randomly scattered and uncorrelated with the predicted values.
Plotting the standardized residual
plot(x = 1:nrow(data2), y = Model1withRes$.std.resid,
xlab = "observation ID", ylab = "std residuals",
ylim = c(-5,5),
main = "Standardized Residuals")
abline(h = c(-2,2), lty = 3, col = "blue")
The plot shows that the vast majority of the points are located within 2 standard deviations from 0, validating that the model does not show significant abnormality.
Plotting the anscombe residuals
#Anscombe residuals
qqnorm(surveillance::anscombe.residuals(model1.1se, 1),
xlim = c(-3,3), ylim = c(-3,3), main = "Anscombe Residuals")
abline(a = 0, b = 1, lty = 2, col = "red")
The quantile-quantile plot shows that the points largely fall very close to the 45-degree line. This shows that the residuals follow a normal distribution, supporting that the model is a good fit.
Examination of Model Accuracy
Generating the confusion matrix and calculate the accuracy rate of prediction. The predicted spx values are compared against the previous week: - if there is an increase of more than 100 basis points (1%), this is defined as a bull week denoted as 1 - a decrease of more than 100 basis points (1%) is defined as a bear week, denoted as -1 - otherwise it’s a base week with value 0
pred.model1<- predict(model1.1se, data2)
classification <- data.frame(change = pred.model1[2:length(pred.model1)]/pred.model1[1:length(pred.model1)-1]-1)
classification$actualchange <- data2$spx[2:length(pred.model1)]/data2$spx[1:length(pred.model1)-1]-1
for (i in 1 : nrow(classification)) {
ifelse(classification$change[i]>0.01,classification$result[i] <- 1,ifelse(classification$change[i]< -0.01,classification$result[i] <- (-1),classification$result[i] <- 0))
ifelse(classification$actualchange[i]>0.01,classification$actual[i] <- 1,ifelse(classification$actualchange[i]< -0.01,classification$actual[i] <- (-1),classification$actual[i] <- 0))
}
confusion1<-table(classification$actual,classification$result)#plot confusion matrix
accuracy1<- (confusion1[1,1]+confusion1[2,2]+confusion1[3,3])/sum(confusion1) #calculate accuracy rate for model1
confusion1
##
## -1 0 1
## -1 41 29 3
## 0 27 82 23
## 1 1 14 35
cat("accuracy for model1 is ",accuracy1,".")
## accuracy for model1 is 0.6196078 .
We can see that the model produces a decent accuracy rate of approximately 62%.
To investigate whether this linear model could be improved further, interaction terms of the explanatory variables are also included in the analysis below.
data4.x <- model.matrix(spx~.^2, data = data4)
data4.x <- data4.x[1:train,-1]
data4.y <- data4$spx[1:train]
Loocv.lasso.data4 <- cv.glmnet(x = data4.x, y = data4.y, alpha = 1, nfolds = nrow(data4.x))
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
plot(Loocv.lasso.data4)
glance(Loocv.lasso.data4)
## # A tibble: 1 x 2
## lambda.min lambda.1se
## <dbl> <dbl>
## 1 0.137 0.217
To allow a fair comparison with the first linear model, the lamda value with the CV error within 1 standard deviation from the best model is also selected, similar to model 1.
lambda1se.data4 <- Loocv.lasso.data4$lambda.1se
lasso.lambda1se4 <- glmnet(x = data4.x, y = data4.y, alpha = 1, lambda = lambda1se.data4)
xset.lambda1se4 <- tidy(lasso.lambda1se4)$term
xset.lambda1se4 <- xset.lambda1se[-1]
f3 <- paste(xset.lambda1se4[1])
for(i in 2:length(xset.lambda1se4))
{
f3 <- paste(f3, " + ", xset.lambda1se4[i])
}
f3 <- paste("spx", " ~ ", f3)
f3 <- as.formula(f3)
model4.1se <- glm(f3, data = data2)
pchisq(glance(model4.1se)$deviance, glance(model4.1se)$df.residual,lower.tail = F)
## [1] 0
summary(model4.1se)
##
## Call:
## glm(formula = f3, data = data2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -203.367 -37.614 -6.614 33.752 250.385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.924e+03 5.956e+01 49.090 < 2e-16 ***
## vix -1.845e+01 1.165e+00 -15.844 < 2e-16 ***
## `put/call` 4.468e+01 1.303e+01 3.429 0.000709 ***
## excess.mkt.return -1.871e-01 3.134e+00 -0.060 0.952437
## smb -1.291e+00 3.868e+00 -0.334 0.738902
## hml -5.530e+00 3.698e+00 -1.496 0.136035
## rf 1.243e+04 5.516e+02 22.542 < 2e-16 ***
## gdp 7.684e+03 1.275e+03 6.027 6.06e-09 ***
## inflation 1.032e+05 2.353e+04 4.385 1.72e-05 ***
## unemployment -1.409e+04 9.424e+02 -14.947 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4578.749)
##
## Null deviance: 24459856 on 255 degrees of freedom
## Residual deviance: 1126372 on 246 degrees of freedom
## AIC: 2896.2
##
## Number of Fisher Scoring iterations: 2
The p-value for the chi-squared test is also approximately 0, rejecting the null hypothesis that this model has no explanatory power.
The various plots below also show that the residual do not exhibit any significant irregularity. However, it seems that the graphs are less ideal compared to those from model 1. For instance, the Anscombe Residual plot shows some deviation from the 45-degree line at low and high quantiles, indicating non-normal behaviour of the residual.
Model1withRes4 <- augment(model4.1se)
pred.data4 <- predict(model4.1se, data2[(train+1):nrow(data2),])
res.data4 <- data2$spx[(train+1):nrow(data2)]-pred.data4
plot(pred.data4, res.data4, type="p")
abline(h=0, col="red")
#Standardized residuals
plot(x = 1:nrow(data2), y = Model1withRes4$.std.resid,
xlab = "observation ID", ylab = "std residuals",
ylim = c(-5,5),
main = "Standardized Residuals")
abline(h = c(-2,2), lty = 3, col = "blue")
#Anscombe residuals
qqnorm(surveillance::anscombe.residuals(model4.1se, 1),
xlim = c(-3,3), ylim = c(-3,3), main = "Anscombe Residuals")
abline(a = 0, b = 1, lty = 2, col = "red")
Examination of Model Accuracy
pred.model4<- predict(model4.1se, data2)
classification$model4 <- pred.model4[2:length(pred.model4)]/pred.model4[1:length(pred.model4)-1]-1
for (i in 1 : nrow(classification)) {
ifelse(classification$model4[i]>0.01,classification$result4[i] <- 1,ifelse(classification$model4[i]< -0.01,classification$result4[i] <- (-1),classification$result4[i] <- 0))
}
confusion4<-table(classification$actual,classification$result4)#plot confusion matrix
accuracy4<- (confusion4[1,1]+confusion4[2,2]+confusion4[3,3])/sum(confusion4) #calculate accuracy rate for model1
confusion4
##
## -1 0 1
## -1 41 27 5
## 0 36 71 25
## 1 2 15 33
cat("accuracy for model4 is ",accuracy4,".")
## accuracy for model4 is 0.5686275 .
The accuracy, despite having included more variables as interaction terms of the existing explanatory variables, is slightly at 57%, which is lower than that of model 1.
We proceed to investigate whether a logistic model could have better predictive power compared to the linear models. Preparation for the logistic model includes defining the dependent variable spx as 3 distinct level: bull, base and bear simular to above.
data3<-data2[-1,]
data3$movement<- classification$actual
#loocv variable selection
data3.x <- model.matrix(movement~., data = data3)
data3.x <- data3.x[1:train,-c(1,13)]
data3.y <- data3$movement[1:train]
Loocv.lasso.data3 <- cv.glmnet(x = data3.x, y = data3.y, alpha = 1, nfolds = nrow(data3.x), family="multinomial")
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
plot(Loocv.lasso.data3)
The lambda value that gives the lowest MSE is selected.
glance(Loocv.lasso.data3)
## # A tibble: 1 x 2
## lambda.min lambda.1se
## <dbl> <dbl>
## 1 0.0375 0.138
lambdamin.data3<-Loocv.lasso.data3$lambda.min
model2<-glmnet(data3.x,data3.y,alpha=1, family="multinomial",lambda=lambdamin.data3)
lasso.lambdamin2 <- glmnet(x = data3.x, y = data3.y, alpha = 1, lambda = lambdamin.data3)
xset.lambdamin2 <- tidy(lasso.lambdamin2)$term
xset.lambdamin2 <- xset.lambdamin2[-1]
f2 <- paste(xset.lambdamin2[1])
for(i in 2:length(xset.lambdamin2))
{
f2 <- paste(f2, " + ", xset.lambdamin2[i])
}
f2 <- paste("movement", " ~ ", f2)
f2 <- as.formula(f2)
model2.min <- glm(f2, data = as.data.frame(data3[1:train,]))
summary(model2.min)
##
## Call:
## glm(formula = f2, data = as.data.frame(data3[1:train, ]))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.21025 -0.61043 0.03294 0.19798 1.39619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3348761 0.4850832 -0.690 0.4908
## spx 0.0001696 0.0001659 1.023 0.3077
## vix -0.0103469 0.0128159 -0.807 0.4204
## excess.mkt.return 0.0446492 0.0298041 1.498 0.1357
## smb -0.0964857 0.0418738 -2.304 0.0222 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.444156)
##
## Null deviance: 93.583 on 203 degrees of freedom
## Residual deviance: 88.387 on 199 degrees of freedom
## AIC: 420.3
##
## Number of Fisher Scoring iterations: 2
It can be observed that most of the dependent variables become insignificant in this model. This shows that possibly, logistic regression is not the best approach to model the relationship between different levels of spx and the dependent variables. However, for comparison and illustration purposes, we will still carry on with this model and investigate the accuracy of prediction.
Examination of Model Accuracy
pred.model2<- predict(model2.min, data3)
classification2 <- data.frame(actual = data3$movement)
for (i in 1 : nrow(classification2)) {
ifelse(pred.model2[i]>0.01,classification2$result[i] <- 1,ifelse(pred.model2[i]< -0.01,classification2$result[i] <- (-1),classification2$result[i] <- 0))
}
confusion2<-table(classification2$actual,classification2$result)#plot confusion matrix
accuracy2<- (confusion2[1,1]+confusion2[2,2]+confusion2[3,3])/sum(confusion2) #calculate accuracy rate for model1
confusion2
##
## -1 0 1
## -1 61 3 9
## 0 84 8 40
## 1 34 2 14
cat("accuracy for model2 is ",accuracy2,".")
## accuracy for model2 is 0.3254902 .
The results show that the accuracy rate is significant lower than the linear model, at less than 50%. This echos the observation that the logistic model has relatively weak explanatory power in projecting spx prices.
Overall, it appears that model 1 produces better graphs as well as higher accuracy rate, so it is by far the best model for projecting future spx prices.
A simulation is carried out using Model 1, which has the best model accuracy among the 3 regression models. To model all the input variables, distribution fitting is carried out to find out the most suitable distributions.
A simple illustration is included below using the variable vix.
library(fitdistrplus)
## Loading required package: survival
## Loading required package: npsurv
## Loading required package: lsei
#vix appears right-skewed from correlation matrix plot at the start
#finding the most suitable distribution for modelling vix
norm <- fitdist(data2$vix, "norm")
gamma <- fitdist(data2$vix, "gamma")
lnorm <- fitdist(data2$vix,"lnorm")
plot.legend <- c("Normal", "Gamma", "Lognormal")
denscomp(list(norm, gamma, lnorm), legendtext = plot.legend)
The density plot seems to suggest that lognormal is the most suitable distribution to model vix. However, a more scientific approach is employed by selecting the distribution that gives the lowest AIC value.
which.min(c(norm$aic, gamma$aic, lnorm$aic))
## [1] 3
The lognormal distribution is finally selected as it yields the minimum AIC value among the 3 candidates.
The simulation model is built to be project the spx price in the first 12 weeks of 2019 (as the input data used for modelling terminates at the last week of 2018). It is an iterative simulation in which input variables are generated for 1 week through samping using the various distributions chosen. These sampled data is added to the input data file to generate the input values for the following week, until the input variable values for the 12th week are finally generated. An estimation of the spx price is computed for each week and the week-on-week price change is calculated.
The user is expected to provide the simulation inputs including the number of iterations to be run, the simulation period (in terms of weeks) as well as the cutoff of spx price change the user wishes to track. For instance, should the user wish to know the probability that the spx price will fluctuate more than 2%, 0.02 is included as the last simulation input.
library(fitdistrplus)
simulate.input<-data2
#default simulation period as 12 weeks (1 quarter)
iteration<-function(iter,periods = 12,cutoff = 0.02){
predicted <- c()
for(i in 1:periods){
fit.vix <- fitdist(simulate.input$vix,"lnorm")
vix.sample <- rlnorm(1,meanlog = fit.vix$estimate,sdlog = fit.vix$sd)
fit.pe <- fitdist(simulate.input$`p/e`,"norm")
pe.sample <- rnorm(1,fit.pe$estimate,fit.pe$sd)
fit.put.call <- fitdist(simulate.input$`put/call`,"norm")
put.call.sample <- rnorm(1,fit.put.call$estimate,fit.put.call$sd)
fit.excess.mkt <- fitdist(simulate.input$excess.mkt.return,"norm")
excess.mkt.sample <- rnorm(1,fit.excess.mkt$estimate,fit.excess.mkt$sd)
fit.smb <- fitdist(simulate.input$smb,"norm")
smb.sample <- rnorm(1,fit.smb$estimate,fit.smb$sd)
hml.norm<-fitdist(simulate.input$hml,"norm")
hml.sample<-rnorm(1,hml.norm$estimate[1],hml.norm$estimate[2])
gdp.norm<-fitdist(simulate.input$gdp,"norm")
gdp.sample<-rnorm(1,gdp.norm$estimate[1],gdp.norm$estimate[2])
#unemployment, risk-free rate and inflation are assumed to stay constant over a quarter
unemployment.sample<-0.037
rf.sample<-0.0048
inflation.sample<-0.00025
simulate.input<-rbind(simulate.input,c(1000,pe.sample,vix.sample,put.call.sample,52.536,excess.mkt.sample,smb.sample,hml.sample,rf.sample,gdp.sample,inflation.sample,unemployment.sample))
predicted[i] <- predict(model1.1se,simulate.input[nrow(simulate.input),])
}
change <- predicted[2:periods]/predicted[1:(periods-1)] -1
boom <- ifelse(sum(change > cutoff)>0,1,0)
return(boom)
}
simulate<-function(iter,periods = 12,cutoff = 0.05){
result<-sapply(1:iter,iteration,periods,cutoff)
prob <- sum(result)/iter
cat("The probability of getting a market increase of ",cutoff," is ",prob,".\n")
}
simulate(10,12,0.025) #iterations reduced to speed up knitting
## The probability of getting a market increase of 0.025 is 0.3 .
The results show that the probability that the spx price will have a week-on-week increase of more than 2.5% (250 basis points) is approximately 20% to 30%, which is a bullish signal for the investor to buy spx now (which is at the start of 2019). This is consistent with the actual spx price trend in Quarter 1 2019, in which the week-on-week return figures indeed showed 3 out 12 that showed an increase of more than 2.5%.
In conclusion, we have attempted to apply PCA, Text Analytics, Regression and Simulation to forecast the US stock market performance in 2019. The stock market performance is tracked using a proxy, the S&P 500 index. Nowcasting GDP using the PCA approach did not generate satisfactory accuracy, but it remains an interesting experiment to project numerical values using principal component analysis.
Modelling of the stock market performance using text analytics and regression generates similar accuracy rates of approximately 60%. Text Analytics helps investors to gauge the general sentiment among participants of the financial market, which is a qualitative measure that is hard to be captured by numerical indicators. In comparison, the regression analysis utilises quantitative measures to project the actual level the spx price could reach and it can be observed that Q1 2019 is predicted to be bullish, which tallies with the actual movement of spx.
We are also aware that both approaches face limitations that adversely affect the applicability and accuracy of the results. Text analytics is constrained by the time horizon of data collection and the dictionary available for the financial jargons. The regression model, on the other hand, fails to adequately capture the clustering nature of financial indices, as market indicators and prices tend to be closer to each other in neighbouring periods rather than periods that are further apart. Also, the regression model is unable to predict the possibility of “black swan events”, which are sudden market crashes fuelled by extreme sentiments such as panic. Text Analytics might be more helpful in capturing these events.
Overall, the results generated are definitely far from ideal but employing these methods could help investors to gauge the likely direction that the stock market might move towards, so that they could frame their investment strategy accordingly.